Skip to content

chore(gastown): promote gastown-staging to main#2974

Merged
jrf0110 merged 14 commits intomainfrom
gastown-staging
May 9, 2026
Merged

chore(gastown): promote gastown-staging to main#2974
jrf0110 merged 14 commits intomainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 30, 2026

Summary

Batch promote of gastown-staging to main. Ten commits since the last promotion, addressing a mix of bug fixes, observability, performance, a feature, and CI plumbing:

# Commit PR Type
1 21a14d04 (no PR — direct push) dev-only fix
2 c532f40e #2999 bug fix
3 db584066 #3047 bug fix
4 47e95a8b #3055 observability + race fix
5 0ebac0f1 #3072 CI fix
6 e20d04f7 #3074 CI fix
7 62e1f135 #3113 feature
8 47933737 #3119 bug fix
9 9ca30347 #3122 performance
10 c5d416dc #3138 bug fix + observability

Constituent changes

  • fix(gastown): point dev GIT_TOKEN_SERVICE binding at git-token-service-dev (21a14d04, no PR — direct push)
    Local-dev binding fix. services/git-token-service/wrangler.jsonc overrides its worker name to git-token-service-dev in env.dev, but gastown's env.dev.services binding still referenced the base git-token-service name. Wrangler's local dev registry does exact-name matching, so the binding showed [not connected] whenever both workers ran side by side. Production binding (top-level services) is untouched; only the env.dev block was changed.

  • fix(gastown): push new model onto resumed mayor session on hot-swap (#2999)
    Regression where changing the mayor's model in town settings persisted the config but the running mayor kept using the previous model — users had to run /model manually. Root cause: a prior fix (9785570b9) skipped the session.prompt(...) call on resumed sessions to avoid a duplicate startup turn, but that call was also responsible for pushing the new model field onto the session. Fix sends a session.prompt with noReply: true and the new model on resume, so the model updates without injecting a synthetic user turn.

  • fix(gastown): stop reconciler log spam from orphaned bead_cancelled events (#3047)
    Production logs were filling with reconciler: applyEvent failed … Bead <id> not found repeating every alarm tick forever. Two cooperating bugs: (1) deleteBead cascaded cleanup to satellite tables but not the town_events reconciler queue, leaving orphan events; (2) the drain loop in Town.do.ts intentionally never marked failed events as processed, expecting retries to eventually succeed — but for a deleted bead they never can. Fix cleans up town_events in deleteBead/deleteBeads, pre-checks bead existence in applyEvent, and classifies Bead/Agent … not found errors as terminal in the drain loop.

  • feat(gastown-container): add crash visibility + per-agent start mutex (#3055)
    Investigation hooks for repeated container restarts seen on a specific town (~1.5–2 min cadence). Adds an unhandledRejection listener with full stack logging (no process.exit), periodic RSS memory logging via the heartbeat path, and a per-agentId mutex in startAgent that fixes a real concurrency race exposed by duplicate /agents/start log lines arriving in the same millisecond for the same mayor.

  • chore(gastown): fix format and lint CI failures (#3072)
    Plumbing fix for this batch PR. services/gastown/src/dos/Town.do.ts and services/gastown/test/integration/event-cleanup.test.ts needed oxfmt formatting; and String(promise) in services/gastown/container/src/main.ts (added by feat(gastown-container): add crash visibility + per-agent start mutex #3055's unhandledRejection handler) tripped no-base-to-string because Promise stringifies to [object Object]. Replaced with a string marker in the log payload.

  • chore(gastown): drop unused promise param from unhandledRejection handler (#3074)
    Follow-up to chore(gastown): fix format and lint CI failures #3072. Removing the String(promise) call left the promise parameter declared but unused, tripping a different lint rule. Drops the parameter entirely; the rejection reason is already captured in the reason field.

  • feat(gastown): add proactive idle-container stop in TownDO alarm (#3113)
    When a town has no active work and the mayor has been idle for >5 minutes, the alarm now calls container.stop() to force a graceful SIGTERM drain instead of waiting for Cloudflare's port-idle timer. PTY WebSocket keep-alives from open dashboard terminals were resetting the port-idle timer indefinitely, leaving 300+ active containers running for ~100 active users. Targets the root cause of that container-count bloat.

  • fix(onboarding): redirect back to onboarding after GitHub app install (#3119)
    When a new gastown user without the Kilo GitHub app clicked "Install GitHub App" during the repo step of onboarding, the GitHub callback always redirected to /integrations/github?success=installed, dropping the user out of the wizard. Extends the GitHub install URL's state parameter with an optional |return=<urlencoded-path> suffix; the callback validates it (same-origin internal path; rejects //, backslash, CRLF) and redirects there with ?github_install=success when present. Onboarding now resumes on the repo step with the newly-installed repos available.

  • feat(gastown): speed up mayor session startup (#3122)
    Three independent optimizations targeting "connection timeout on first nav" failures and slow mayor startup. (1) Prewarm mayor SDK server in bootHydration — eagerly hydrate kilo.db and start kilo serve even when the mayor wasn't in the registry (e.g. idle-stopped via feat(gastown): add proactive idle-container stop in TownDO alarm #3113), collapsing the sdk_ready phase from 2–6 s to <50 ms on warm restarts. (2) Seed the getMayorStatus cache directly from ensureMayor's response instead of waiting for the next 3 s poll tick. (3) Detect torn-down SDK in _ensureMayor's short-circuit — when the container reports the mayor as running but serverPort=0 or sessionId='', fall through to a fresh dispatch instead of returning early (eliminates the "refresh fixes it" class of failures). Also adds AE telemetry: mayor.prewarm_complete, agent.startup_phase, mayor.ensure_decision.

  • feat(gastown): stop convoy landing-MR respawn loop when PR awaiting approval (#3138)
    Convoy landing PRs that were waiting on human approval (branch protection requires it) were piling up 5 failed merge_request beads against a single green PR. The reconciler treated GitHub's mergeable_state='blocked' as allGreen=true, tried to merge, got 405, and respawned the MR — up to MAX_LANDING_MR_ATTEMPTS = 5. Fix extends the existing checkPRFeedback GraphQL query with reviewDecision and mergeStateStatus, derives awaitingApproval, gates allGreen on it, and suppresses both per-MR failure transitions and convoy-level respawns when the PR is structurally green but waiting on humans. Pushed back on a Workers AI proposal here — reviewDecision is a structured GitHub field that answers the question deterministically; LLM is the right tool for areThreadsBlocking (interpreting prose) but the wrong tool for "should I attempt a merge?". Adds AE events pr.awaiting_approval_detected, pr.awaiting_approval_resolved, reconciler.respawn_suppressed with corresponding panels on the gastown ops Grafana dashboard.

Verification

  • Each constituent PR was reviewed and merged into gastown-staging independently — see PR links above for per-change verification details.
  • 21a14d04 was manually verified locally (wrangler dev --env dev for both workers, binding now connects).

Reviewer notes

  • Changes are scoped to services/gastown (container + DO/worker code), apps/web/src/app/(app)/gastown/... (onboarding wizard + terminal bar), apps/web/src/app/api/integrations/github/callback/route.ts (install callback), and the env.dev block of services/gastown/wrangler.jsonc.
  • Production worker bindings are unchanged.
  • The GitHub callback change in fix(onboarding): redirect back to onboarding after GitHub app install #3119 is fully backwards-compatible with old in-flight installs (state values without a |return=… suffix fall through to the existing default redirect).
  • feat(gastown): stop convoy landing-MR respawn loop when PR awaiting approval #3138 is also backwards-compatible — bead metadata gains an awaiting_approval field that's 0 for old beads; the new logic only activates when poll_pr writes 1 to it.

Outstanding review comments

PR #2974 has three open WARNING-level review comments from kilo-code-bot that need addressing before merge. A fixup bead has been slung — see follow-up PR for the fix.

…e-dev

git-token-service's wrangler env.dev overrides the worker name to
'git-token-service-dev', but gastown's env.dev.services binding was
still referencing the base 'git-token-service' name. Wrangler's local
dev registry does exact-name matching, so the binding showed as
[not connected] whenever both workers were running side by side.

Every other consumer in the repo (cloud-agent-next, security-sync,
security-auto-analysis) already uses 'git-token-service-dev' in their
env.dev block; gastown was the outlier.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 30, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
N/A N/A N/A

Fix these issues in Kilo Cloud

Other Observations (not in diff)

Issues found in unchanged code or files outside the PR diff that cannot receive inline comments:

File Line Issue
.specs/kiloclaw-referrals.md 186 Referral touch validity is inverted: the spec says conversion_time < touched_at - 30 * 24 hours instead of using the plus-30-day expiry cutoff described by the definition.
AGENTS.md 115 The specs table still points to .specs/impact-affiliate-tracking.md, but this branch renames that spec to .specs/kiloclaw-affiliates.md, leaving the repo guidance with a broken path.
Resolved Previous Findings
File Line Issue
apps/web/src/app/(app)/gastown/onboarding/OnboardingStepRepo.tsx 68 GitHub install now has maintainer follow-up indicating the undefined user case was fixed.
services/gastown/container/src/process-manager.ts 2686 Mayor prewarm workspace creation was marked fixed by maintainer follow-up.
services/gastown/src/dos/town/actions.ts 1112 Draft PR auto-merge gating was marked fixed by maintainer follow-up.
Files Reviewed (25 files)
  • apps/web/src/app/(app)/gastown/onboarding/OnboardingStepRepo.tsx - 0 issues
  • apps/web/src/app/(app)/gastown/onboarding/OnboardingWizardClient.tsx - 0 issues
  • apps/web/src/app/api/integrations/github/callback/route.ts - 0 issues
  • apps/web/src/components/gastown/MayorChat.tsx - 0 issues
  • apps/web/src/components/gastown/TerminalBar.tsx - 0 issues
  • apps/web/src/lib/integrations/validate-return-path.ts - 0 issues
  • apps/web/src/lib/integrations/validate-return-path.test.ts - 0 issues
  • services/gastown/wrangler.jsonc - 0 issues
  • services/gastown/container/src/agent-runner.ts - 0 issues
  • services/gastown/container/src/main.ts - 0 issues
  • services/gastown/container/src/process-manager.ts - 0 issues
  • services/gastown/container/src/process-manager.test.ts - 0 issues
  • services/gastown/container/vitest.config.ts - 0 issues
  • services/gastown/src/dos/Town.do.ts - 0 issues
  • services/gastown/src/dos/town/actions.ts - 0 issues
  • services/gastown/src/dos/town/beads.ts - 0 issues
  • services/gastown/src/dos/town/container-dispatch.ts - 0 issues
  • services/gastown/src/dos/town/container-idle-stop.ts - 0 issues
  • services/gastown/src/dos/town/container-idle-stop.test.ts - 0 issues
  • services/gastown/src/dos/town/reconciler.ts - 0 issues
  • services/gastown/src/dos/town/town-scm.ts - 0 issues
  • services/gastown/src/gastown.worker.ts - 0 issues
  • services/gastown/test/integration/awaiting-approval.test.ts - 0 issues
  • .specs/kiloclaw-referrals.md - 1 issue (outside PR diff)
  • AGENTS.md - 1 issue (outside PR diff)

Reviewed by gpt-5.5-2026-04-23 · 2,885,116 tokens

jrf0110 and others added 3 commits May 1, 2026 16:22
…2999)

When a user changes the mayor's model in town settings, updateAgentModel
restarts the SDK server with new KILO_CONFIG_CONTENT and resumes the
existing session from kilo.db. Commit 9785570 intentionally stopped
sending any session.prompt on resume to avoid duplicating the
MAYOR_STARTUP_PROMPT, but that also dropped the model param — so the
resumed session kept its prior per-session model until the user ran
/model manually.

Extract the fresh vs. resumed session-prompt logic into applyModelToSession
and on resume send a noReply:true prompt carrying only the new model
param. This updates the SDK server's per-session model without replaying
the startup prompt. Errors on the resume path are swallowed so the
hot-swap still succeeds; the SDK server fell back to the config-loaded
model at startup, which was already updated.

Add container tests covering both fresh and resumed paths.

Co-authored-by: John Fawcett <john@kilcoode.ai>
…vents (#3047)

Two independent bugs compose to flood production logs every alarm tick
with 'Bead <id> not found' errors:

1. deleteBead / deleteBeads did not clean up the town_events queue,
   leaving bead_cancelled and container_status rows pointing at deleted
   beads/agents.
2. applyEvent threw on missing beads and the drain loop never marked
   the failing event processed — so it retried forever.

Fix 1: purge town_events rows (by bead_id OR agent_id, since agents are
beads) from deleteBead and the deleteBeads bulk path.

Fix 2a: reconciler.applyEvent('bead_cancelled') checks for the target
bead up front and returns (with a warn) when it's missing, instead of
throwing.

Fix 2b: the Town.do.ts drain loop recognises 'Bead/Agent <uuid> not
found' terminal errors, logs them at warn, and marks the offending
event processed so it stops retrying.

Adds debug RPCs (debugTownEvents, debugInsertTownEvent,
debugRecordContainerStatus) and integration coverage in
event-cleanup.test.ts.

Co-authored-by: John Fawcett <john@kilcoode.ai>
…#3055)

* feat(gastown-container): add crash visibility + per-agent start mutex

Diagnostic changes to investigate frequent container restarts for town
4d82f099-ccb7-4eaf-8676-73562e0a27eb (~1.5–2 min boot-hydration loops).

- main.ts: add unhandledRejection listener that logs full error/stack
  without exiting (Bun/Node silently drop rejections without a handler,
  making fire-and-forget failures like void saveDbSnapshot()/void
  subscribeToEvents() invisible). Include uptime and active-agent count
  for correlation.
- main.ts: improve uncaughtException log with name/uptime/agent count.
- main.ts: 30s periodic container.memory_usage log (rss/heap/external)
  so OOM-class failures (external SIGKILL from Cloudflare Containers
  runtime when the memory ceiling is hit) become observable — these
  leave no exception behind.
- main.ts: wrap bootHydration() in try/catch so a rare synchronous throw
  before the first await doesn't crash the process.
- process-manager.ts: add per-agentId mutex for startAgent. Production
  logs show two /agents/start requests for the same agentId logged at
  the same millisecond; both pass the re-entrancy check before either
  commits a 'starting' record, then race on startupAbortController,
  session creation, idle timers, and SDK sessionCount. Serialising
  per agentId makes the re-entrant path observe a consistent snapshot.
- process-manager.test.ts: three tests for the mutex — same-id
  serialisation, different-id concurrency, lock release on throw.

* fix(container): replace Promise.withResolvers with explicit new Promise

Promise.withResolvers is a newer API not available on older Bun
runtimes. Since process-manager.ts is imported during container
startup, a missing global would throw before crash handlers are
registered and prevent the control server from starting. Use the
same explicit new Promise pattern as the existing sdkServerLock.

* feat(gastown/container): include townId in crash and memory logs

Per review feedback, attach the container's GASTOWN_TOWN_ID to
unhandled_rejection, uncaught_exception, cold_start, memory_usage,
and boot_hydration_failed log entries so production crash logs can
be correlated with a specific town without needing to also have an
agent registered.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
chore(gastown): fix format and lint CI failures on staging

Co-authored-by: John Fawcett <john@kilcoode.ai>
jrf0110 and others added 3 commits May 6, 2026 10:56
…dler (#3074)

Co-authored-by: John Fawcett <john@kilcoode.ai>
* feat(gastown): add proactive idle-container stop in TownDO alarm

When a town has no active work and the mayor has been idle for >5min,
the alarm handler now calls container.stop() to force a graceful SIGTERM
drain instead of waiting for Cloudflare's port-idle timer (which gets
reset by PTY WebSocket keep-alives). This targets the root cause of
300+ active containers for ~100 active users.

- Add stopContainerIfIdle() with dependency-injected logic in
  town/container-idle-stop.ts for testability
- Wire into alarm handler just before the re-arm block
- Emit container.idle_stop event with reason for observability
- 2min throttle prevents thrash; failed stops are retried next tick
- 13 unit tests covering all branches

* fix(gastown): remove non-null assertions from container-idle-stop

Replace mayor.last_activity_at! with null-safe checks using
mayor.last_activity_at != null guards, consistent with coding
style that forbids ! non-null assertions.

* fix: allow stopping healthy containers in idle-stop guard

The state guard only checked for 'running', but containers can also
report 'healthy' as an active state (consistent with gastown.worker.ts).
Added 'healthy' to the guard and a corresponding test.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…#3119)

* feat(onboarding): redirect back to onboarding after GitHub app install

* fix: address PR review feedback — stable effect deps, parsed error state, URIError guard

- OnboardingStepRepo: use stable refetch reference and scalar param
  instead of full query object to prevent duplicate toasts/refetches
- GitHub callback: parse owner token from state in error handler so
  |return= suffix doesn't leak into org redirect URLs
- validate-return-path: catch URIError from malformed percent-encoding
  and treat as invalid return path (null) instead of throwing

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
Comment thread apps/web/src/app/(app)/gastown/onboarding/OnboardingStepRepo.tsx
* feat(gastown): prewarm mayor SDK server in bootHydration

Add mayor SDK server prewarming to bootHydration so the mayor's kilo
serve instance is already running when the user's first /agents/start
arrives after a container restart. Previously, the mayor was only
resumed if it was in the registry (running/starting at shutdown), but
idle-stop and stream-error teardowns leave the mayor unregistered.

- Export mayorWorkdirForTown() from agent-runner.ts
- Add prewarmMayorSDK() to process-manager.ts that fetches the mayor
  agent ID from a new worker endpoint, hydrates kilo.db from KV
  snapshot, and starts the SDK server
- Add GET /api/towns/:townId/mayor-id endpoint to gastown.worker.ts
  (uses authMiddleware like container-registry/db-snapshot)
- Add getMayorAgentId() RPC method to Town.do.ts
- Add warm-cache detection in startAgentImpl: log phaseMs: 0 and
  prewarmed: true when the SDK instance was already cached
- bootHydration no longer returns early on empty registry so the
  mayor prewarm always runs

* feat(gastown): seed getMayorStatus cache from ensureMayor response

Instead of only invalidating the getMayorStatus query after ensureMayor
succeeds (which forces a 3s polling wait before useXtermPty can start
connecting), seed the React Query cache directly from the mutation
result. The agentId and sessionStatus are already available in the
ensureMayor response, so the terminal can begin connecting within
~50ms instead of waiting for the next poll tick.

Still invalidate after seeding so the next poll catches up to
authoritative state.

* fix(gastown): detect torn-down SDK in _ensureMayor short-circuit

When the container reports the mayor as 'running'/'starting' but the
SDK instance has no serverPort or sessionId (torn down after stream
errors or drain), _ensureMayor now falls through to a fresh dispatch
instead of returning early. This eliminates the 'refresh fixes it'
class of failures where the PTY gets a 404 because there's no SDK
port to attach to.

Also extend checkAgentContainerStatus to surface serverPort and
sessionId from the container's agent status response.

* feat(gastown): add AE telemetry events for mayor startup optimization

Add three Analytics Engine event streams to measure the impact of the
mayor startup optimizations:

1. agent.startup_phase — emitted for db_hydrated, sdk_ready, and
   session_created phases. Includes elapsedMs and phaseMs so we can
   P50/P95 per-town. The sdk_ready event includes phaseMs: 0 when
   the SDK was prewarmed (warm-cache hit).

2. mayor.prewarm_complete — emitted when the mayor SDK server is
   prewarmed during bootHydration, with durationMs.

3. mayor.ensure_decision — tracks the _ensureMayor decision tree:
   short_circuit_warm, short_circuit_idle, sdk_dead_redispatch, or
   fresh_dispatch. Measures the rate of the SDK-dead case that Change
   3 fixes.

Container-side events are proxied to AE via a new
POST /api/towns/:townId/container-events worker endpoint, since the
container can't call writeEvent directly.

* test(gastown): add integration test for torn-down-SDK fall-through

Test that _ensureMayor falls through when the container status doesn't
indicate a live SDK (no serverPort or sessionId). Covers:

1. Container not available in test env (baseline behavior)
2. sdkAlive validation logic: zero port, empty session, valid values
3. checkAgentContainerStatus returns 404 for unknown agents

* fix(gastown): set per-agent KILO_TEST_HOME/XDG_DATA_HOME in prewarm env

The prewarm function was copying KILO_TEST_HOME and XDG_DATA_HOME from
process.env, but those are typically absent at the container level. Normal
agent startup sets them per-agent via buildAgentEnv(). Without these, the
prewarmed SDK server boots against the default data directory and bypasses
the hydrated kilo.db snapshot.

Now buildPrewarmEnv() sets KILO_TEST_HOME and XDG_DATA_HOME based on the
mayorAgentId, matching what buildAgentEnv() does for regular agents.

* fix(gastown): generate KILO_CONFIG_CONTENT in prewarm env and handle config mismatch on cache hit

Prewarm now generates KILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT using
buildKiloConfigContent() with the kilocode token and default models
instead of copying them from process.env (where they're absent on cold
start). When ensureSDKServer() finds a cached instance whose config
differs from the incoming env, it evicts the old server and creates a
new one so the SDK picks up the correct model/provider config. Also
extracts PERSIST_ENV_KEYS to module-level and updates process.env for
those keys on cache hit when configs match.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
Comment thread services/gastown/container/src/process-manager.ts
…pproval (#3138)

* feat(gastown): stop convoy landing-MR respawn loop when PR awaiting approval

Add GitHub reviewDecision and mergeStateStatus to PR feedback checks,
gating auto-merge on awaitingApproval/changesRequested flags. Persist
awaiting_approval in MR bead metadata, suppress Rule 4 timeout and
Rule 6 refinery re-dispatch when set, and prevent convoy-level respawn
of landing MRs for approval-blocked PRs.

Fixes: 7ff83865

* fix(gastown): hoist reviewDecision/mergeStateStatus/isDraft outside if block

Block-scoped variables were used outside their declaration scope,
causing TypeScript compilation failure. Declared with let and
default values before the try block, assigned inside the if block.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
Comment thread services/gastown/src/dos/town/actions.ts
@jrf0110 jrf0110 merged commit 32881b7 into main May 9, 2026
14 checks passed
@jrf0110 jrf0110 deleted the gastown-staging branch May 9, 2026 16:20
kilo-code-bot Bot pushed a commit that referenced this pull request May 10, 2026
…ayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
jrf0110 added a commit that referenced this pull request May 10, 2026
* chore(gastown): remove manual request logging middleware

* fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.

* feat(gastown): per-route logger tagging via Hono params (review on #3158)

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants